Final Project: Predicting Diabetes from Medical Records

Overview

1: Introduction and Data Collection

2: Data Processing

3: Exploratory Data Analysis (EDA) & Visualization

4: Machine Learning

Conclusion

1 Introduction

Diabetes is a massive problem within american society currently. Though people can live with it, they have to be very dilligent in trying to manage their health. It is currently increasing in the US at an alarming rate in the United States. According to the CDC's records, cases of diabetes have risen to an estimated 34.2 million cases (https://www.diabetesresearch.org/file/national-diabetes-statistics-report-2020.pdf).

Our main objective is to analyse the dataset of attributes related to diabetes and to predict whether that person is diabetic or not. We will be applying machine learning algorithms to try to achieve this goal.

1.1 Libraries:

These are the libraries used in this tutorial.

1.2 Data Used

In the data collection stage of the data life cycle, you need to focus on collecting data from websites and databases.

We have found data from Kaggle at: https://www.kaggle.com/datasets/mathchi/diabetes-data-set

This data has all the attributes needed for predicting diabetes which will aid us to reach our final goal. The data is of 21 year old women at the Pima Indian Heritage. The data is organized with the folloing attributes as described by the website:

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction: Diabetes pedigree function

Age: Age (years)

Outcome: Class variable (0 or 1)

As you can see, from the data shown above each individual has an number id and has each attribute filled out.

Outcomes, the most important piece of data is the last column.

As we can see, there are 768 individual data points. There are some interesting points to how much the data points vary for example skin thickness can vary from 23 at the 50th percentile to 99 at the maximum.

Other things to note is that there are no abnormal conditions seen outside of things expected to see in diabetics, for example the maximum blood preasure is 122 which is within a normal range and isnt considered to be harmful.

2 Data Processing

2.1 Data Observations

Now that we have gathered our data we need to make it usable for analysis. We will need to make a lot of changes to our data and perform data tidying. In our case the dataset was selected to be exactly what we would need to predict diabetic cases so we will not need to remove rows or categories. However there is a good amount of missing values that we will need to take care of.

lets first take a look at the data.

As we can see there is a good split among the people involved in this data. There are 268 people who have confirmed cases of diabetes and there are 500 who do not have it. It is an approximate 34% positive and 65% negative split.

As we can see from the descriptions of the combined data, the positive, and the negative data there are very big differences between them. This shows that we cannot rely on the combined data to be able to predict diabetic cases and we will have to analyse the datasets differently. We will be visualizing and looking through the dataset with a more focused look later.

The z scores indicate where each point lies on a distribution. It uses mean and standard distribution to show how far from the mean the point is. The z score that we observed for the data above seems is between -3 and 3 so we will use the range -3 to 3 as cutoff for outliers.

As you can see, this has reduced the data by quite a bit however the ratios are very similar to where they were before.

2.2 missing values

when looking at our dataframe, we saw that some features contained 0s when they shouldn't and it doesnt make sense. We will replace these values with NaN.

Using the boxplot below we can now see that we no longer have 0s for missing values. They are now encoded with NaN values.

We now have a bunch of missing values that we need to fill. The missing data doesn't seem truly random as it isnt evenly spread among all attributes that has missing values. There isn't enough information on how this data was recorded so we cannot assume anything out of the ordinary so we need to assume that the values are missing at random (MAR).

There are many different imputations but we will be using median imputation (similar to mean) since we have seen in how varied this data can be and the likely hood of outliers.

3 Exploratory Data Analysis (EDA) & Visualization

This is the part of the pipeline called exploratory analysis. At this stage we want to observe any possible trends. We can apply statistical analysis to better support our observations and find evidence of the trends found.

In our case it is convienent to finish imputing variables while visualizing and doing EDA so we will be doing that here as well. We will look at the positive and negative results seperately at this point. We will not look at the combined data, ie both positive and negative data together, as we are looking for differences between the data and we need the two sets to be seperate.

3.1 Insulin

We will be looking at insulin first to find the median those who tested positive and those who tested negative. As we can see from the table labeled Insulin of a diabetic person, most patients are in the range between 0 and 650.

Comparitavely, the insulin of a healthy person is typically between 0 and 500 which gives a smaller range of values. It also peaks around 100 and tapers off much faster than the positive results.

As we can see there is a drastic change in insulin between a healthy person and unhealthy person. A health person has a median of 102.5 and an unhealthy person has a median of 169.5.

Now we will median fill the missing values with their respective group (Diabetic or Healthy).

3.2 Glucose

Now we need to move on to other attributes with missing values and do the same for them.

Below we have plotted the glucose data in the same manner as we did Insulin. The positive diabetic glucose information is almost consistent between 75 and 200. It seems to peak at 125 and tapers of at a low rate. Comparatively, the glucose of a normal person peaks at 100 and tapers off at a high rate.

3.3 Blood Pressure

We will now move on to Blood Pressure.

Lets take a quick look at the confusion matrix to see if we can find the highest correlations.

Seemingly the highest correlation there is is between the Skin thickness and BMI. This is a moderate correlation and the closest one to it would be age and pregnancies.

4: Machine Learning

In this part we will finally implement models that will be able to predict diabetic cases through medical records. Machine learning has many uses including the capacity to classify data and create regression based on the data. We will not be using regression as they are unneeded for our project but classification of data is essential.

We will be using:

4.1 Preparing the data

All learning models need to be trained so will will use and 80/20 split to train the models. We will allocate the expected output to be the Outcome attribute and the rest to be the predictors.

4.2 Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis is a linear model for classification and dimensional reduction. It is most often used for feature extraction in pattern classification problems which is perfect for our analysis.

LDA is currently one of the most popular classification models that is very good as binary classification. There are some shortcomings like how linear decision boundaries can be uneffective when it comes to seperating non-linearly seperable classes when more flexible boundaries are desired.

We can see that the resulting accuracy both with the internal function which measures accuracy and the 10 fold cross validation that calculates accuracy pins it at around 80% accuracy.

4.3 Random Forest:

As we can see, the confusion matrix seems to classify things fairly well. out of 97 total non diabetics in our test case, we successfully classified 92 and mis classified 5. Out of the 41 diabetic cases 34 were correctly classified while 7 werent. As seen below, the accuracy is around 85% which is still very strong.

4.4 Decision Tree

For this learning algorithm we will be applying all attributes to the model. In general, a decision tree consists of many paths nodes that originate from the root of the tree. These paths go through different nodes at whcih a decision is made that will determine where to go next.

In our case, we want to classify whether a person does or doesnt have diabetes so we want the tree to split the inputted data continously in order to correctly predict categories that a datapoint would fall under. In our case we will not be using a regression for our decision tree as we do not need it to predict a value for an attribute but simply determine whether a certain set of values mean diabetes or not.

As you can see this decision tree results in 85 percent accuracy. We used a max depth of 4 which seen optimal as we can see below higher depths seems to overfit the data.

Conclusion

After viewing the data from the graphs we can safely determine that higher glucose values can determine diabetes. The other attributes are also important but not on the level of Glucose. We have seen that when Insulin is at a certain point, there is a strong likelyhood of being diabetic. Another good indicator is having a high number of preganencies as well as higher age.

As we can see, Random forest is by far the most accurate model while decision is the next highest. Though we understand that our dataset wasnt very large, (it contained 700 datapoints) we can apply our model to the world and confindentally be able to assess a person's likelyhood of being diabetic.